Record: Causal SLOT + Pre-quant TTT — val_bpb 1.0846 (3-seed mean)#1306
Record: Causal SLOT + Pre-quant TTT — val_bpb 1.0846 (3-seed mean)#1306resouer wants to merge 1 commit intoopenai:mainfrom
Conversation
3-seed mean 1.0846 (std 0.0007). Beats merged SOTA (1.1147) by 0.030. Novel: provably causal eval-time delta optimization (causal SLOT). Unlike standard SLOT (PR openai#1240 proved 100% causal violation), delta is optimized using only backward-looking loss from already-scored positions. Combined with 6-epoch pre-quant AdamW TTT and coprime-stride multi-shard data loading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8930d5a to
d43a0f3
Compare
…nai#1303 at 0.9462 - logs/daily_research.md: full daily report; PR openai#771 rejected confirmed, n-gram PRs status, leaderboard unchanged (1.1147), headline PR openai#1303 (0.9462 bpb, legality unconfirmed), PR openai#1306 Causal SLOT (-0.009) + Pre-quant TTT (-0.022), new paper scan (LaCT, pQuant, SLOT paper) - CLAUDE.md v7.1: updated key reference PRs (openai#1303, openai#1306), corrected SLOT technique table (standard SLOT disputed, Causal SLOT lower-risk alternative, Pre-quant TTT novel entry) https://claude.ai/code/session_01AUKKvYMVeeWQzfTKocVaJZ
|
I think this PR would benefit from separating the legality story for the two adaptation mechanisms more explicitly. To me, the causal SLOT part is the strongest piece of the argument, because the writeup says the delta objective is restricted to context-only / already-scored positions. That is at least directionally aligned with the current README / The part that still seems underspecified is:
Under the current rule framing, I think reviewers will want to know how that piece satisfies the same four conditions, especially:
So I think the most helpful clarification would be a small compliance section that treats the two components separately:
I’m not saying the causal SLOT argument is weak — in fact that part reads much more plausibly Track-B-compliant than standard SLOT. I just think the pre-quant TTT piece needs a more concrete score-before-update explanation than the PR body currently gives. |
|
Closing in favor of PR #1350 (L-BFGS Causal SLOT, 1.0046 BPP). This submission (1.0846 BPP) used AdamW causal SLOT (-0.009 delta). PR #1350 replaces AdamW with L-BFGS in logit space (-0.087 delta), achieving 1.0046 BPP — a significant improvement on the same causal framework. All other techniques (pre-quant TTT, coprime loader) are carried forward. |
Summary
3-seed mean val_bpb: 1.0846 (std 0.0007) | ~15.95 MB | 8xH100 SXM | ~551s eval
Merged SOTA (PR #1019, 3-seed mean): 1.88218 nats. This run: 1.83126 nats. Delta: -0.051 nats. Clears the 0.005-nat threshold.
Results (3-seed)
Changes from Merged SOTA (PR #1019)
1. Causal SLOT — provably causal eval-time delta optimization (Novel)
Standard SLOT (PR #1172, #1176, #1229) optimizes delta using loss from all positions including future ones. PR #1240 proved this violates causal dependence (100% violation rate). Our causal SLOT restricts optimization to context-only positions — tokens already scored in previous windows. Provably causal: P(x_{t+1}) depends only on x_1,...,x_t. Delta: -0.009 BPP, ~300s eval time.
2. Pre-quant AdamW TTT (6 epochs)
AdamW TTT on full-precision EMA weights before GPTQ quantization. Post-quant SGD TTT fails on GPTQ stacks (25 failures per PR #756). Pre-quant TTT adapts weights that then quantize better. Delta: -0.022 BPP, 111s.
3. Coprime-stride multi-shard data loader
Weighted random shard sampling with coprime stride patterns for batch diversity. Delta: -0.003 BPP.
Reproduction
No env vars needed. FA3 required (see requirements.txt).
Credits
Base: PR #1019 (@abaybektursun). SLOT concept: arXiv:2505.12392v2, PR #1176 (@bigbag). Coprime-stride loader: PR #1184 (@icryo). Pre-quant TTT concept: PR #1006. Causal SLOT: novel (this submission).
Generated with Claude Code